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Abstract— A field-programmable multiprocessor integrated 
circuit, PADDI (for Programmable Arithmetic Devices for 
High-Speed Digital Signal Processing), has been designed for 
the rapid prototyping of high-speed data paths typical to real- 
time digital signal processing applications. The processor ar- 
chitecture addresses the key requirements of these data paths: 
a) fast, concurrently operating, multiple arithmetic units, b) 
conflict-free data routing, c) moderate hardware multiplexing 
(of the arithmetic units), d) minimal branch penalty between 
loop iterations, e) wide instruction bandwidth, and f ) wide I/O 
bandwidth. The initial version contains eight processors con- 
nected via a dynamically controlled crossbar switch, and has a 
die size of 8.9 x 9.5 mm 2 , in a 1.2-um CMOS technology. With 
a maximum clock rate of 25 MHz, it can support a computation 
rate of 200 MIPs and can sustain a data I/O bandwidth of 400 
megabytes/s with a typical power consumption of 0.45 W. An 
assembler and simulator have been developed to facilitate pro- 
gramming and testing of the chip. A software compilation path 
from the high-level data flow language SILAGE [15] to PADDI 
is currently under development, and handles partitioning, 
scheduling, and code generation. 



I. Introduction 

IN many real-time digital signal processing systems, 
tasks are computation intensive because high through- 
put is required, e.g., in real-time image processing and 
video applications, or because the tasks are complex, as 
in real-time speech recognition. Traditional microproces- 
sor-based architectures are often inadequate to meet the 
computation requirements, and so clusters of dedicated 
data paths, hard-wired to closely match the algorithmic 
data flow, are used instead. Such architectures typically 
contain multiple and concurrently operating data-path 
pipelines. 

The problem that we address is the rapid prototyping of 
computation-intensive DSP data paths. In real-time DSP 
applications the typical ASIC solutions take a long time 
to fabricate and test, and are not easily modified. A rapid 
prototyping capability will help alleviate these problems. 
In this paper we will first discuss the architectural features 
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and computational requirements of high-speed DSP data 
paths. We propose a novel approach for synthesizing these 
data paths, first discussed in [5]. We will describe the 
architecture, circuit design, and implementation of a pro- 
totype chip, PADDI (for Programmable Arithmetic De- 
vices for High-Speed Digital Signal Processing), which 
serves as proof of concept. We will give a brief overview 
of the grammar, assembler, and simulator which have 
been developed to assist the low-level programming of 
PADDI, and the CAD environment and tools being de- 
veloped for automatic compilation from a high-level lan- 
guage. We will also compare our approach with existing 
ones. 

A. Computation Requirements of High-Speed DSP 

The goal of this work is to define a set of high-level 
programmable macro components to support the rapid 
prototyping of real-time DSP data paths. Case studies of 
real-time algorithms and pipelined data-path architectures 
enable us to identify the following key architectural fea- 
tures which must be supported by these macro compo- 
nents: 

a) a set of concurrently operating execution units 
(EXU's) with fast arithmetic, to satisfy the high 
computational (hundreds of MOPs) requirements; 

b) very flexible communication between the EXU's to 
support the mapping of a wide range of algorithms 
and to ensure conflict-free data routing for efficient 
hardware utilization; 

c) support for moderate (1-10) hardware multiplexing 
on the EXU's, for fast computation of tight inner 
loops; 

d) support for low overhead branching between loop 
iterations; 

e) wide instruction bandwidth; 

f) wide I/O bandwidth (hundreds of megabytes/s). 

The examples were drawn from real-time video, 
speech, and image processing applications. They in- 
cluded biquadratic filters (nonpipelined and pipelined 
[16], [21]), the RGB to YUV converter of [14], the 3 x 
3 image convolver and nonlinear sorting filters from [18] 
and [19], a memory controller for video coding [20], a 
dynamic time warp speech processor [12], and the word 
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Fig. 1. Total number of operations versus operation type. 



TABLE I 
Computations and I/O Summary 
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and grammar processing subsystems of a large vocabulary 
real-time speech recognition system [4], [22]. 

By careful analysis of these examples we were able to 
derive the essential features of our programmable archi- 
tecture. For example, by profiling the total number of oc- 
currences of various operations across the benchmark set, 
we were able to identify the critical operations that should 
be supported. The result is presented in Fig. 1. The per- 
centage occurrence of all occurrences is also listed. 
Clearly, by adopting the ten-percent rule, architectural 
support for add/sub, shifts, comparisons, and 2-to-l mul- 
tiplexing is desirable. 

Table I summarizes the computational and I/O require- 
ments of a few examples from [4], [14], [18], [19], and 
[22]. From these numbers we can see that real-time DSP 
applications place a tremendous demand on both compu- 
tation and bandwidth requirements. 

B. Software-Configurable Hardware 

Fig. 2 shows the relative flexibility and performance of 
implementation approaches available to the system de- 
signer. Software-based microprocessor and digital signal 
processor approaches are very flexible, but often do not 
achieve the required performance. ASIC approaches often 
have high nonrecurring engineering (NRE) costs, and can 
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Fig. 2. Implementation alternatives. 

take considerable time (months) and effort to fabricate and 
test. Software-configurable hardware combines the flexi- 
bility of software approaches with the high performance 
of hardware approaches. We will now describe the PADDI 
software-configurable architecture created to fill this gap. 

II. PADDI Architecture and VLSI 
Implementation 

A. Overview 

The PADDI architecture provides a novel hardware 
platform for the rapid prototyping of algorithmic specific 
high-speed data paths. PADDI is software configurable, 
which allows algorithms to be hardwired into the archi- 
tecture. 

The basic architecture of our prototype chip is outlined 
in Fig. 3. It contains a cluster of eight EXU's, each with 
its own local controller. (The complete architecture con- 
tains four of these clusters on a single chip.) The EXU's 
are connected by a dynamically configurable, crossbar 
communication network. The architecture addresses the 
key requirements for rapid prototyping of dedicated high- 
speed data paths as follows, a) Each EXU contains dedi- 
cated hardware support for fast arithmetic, b) A crossbar 
switch that is under program control ensures conflict-free 
data routing within a cluster of EXU's. Global broad- 
casting from a single source is supported, and the dy- 
namic nature of the interconnect ensures that multiple 
sources can be merged at a single destination, c) The 
combination of flexible local interconnect, distributed 
memory (in the form of register files), and local control- 
lers supports direct mapping of flow graphs to EXU's, 
and the multiplexing of more than one operation onto a 
given EXU. d) By using multiple EXU's rather than super- 
pipelining, a single or a few EXU's to achieve high com- 
putation rates, and by providing appropriate logic, low 
branch penalty for conditional branches and between loop 
iterations is achieved, e) High instruction bandwidth is 
ensured by assigning to each EXU its own dedicated con- 
troller, e) One hundred and twenty eight dedicated I/O 
pins allow EXU clusters to communicate with other clus- 
ters with high bandwidth (400 megabytes /s). The com- 
ponent parts of the architecture will be described in the 
following sections. 

B. EXU Architecture 

Fig. 4 shows the internal architecture of an EXU. Two 
16-b-wide register files, each containing six registers, are 
used for the temporary buffering of data. The files are 
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Fig. 3. Prototype architecture. 
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Fig. 4. EXU architecture. 

dual-ported for simultaneous read and write operations. 
They can also be configured as delay lines to support 
pipelining and retiming of operations. In each register file, 
one of the six registers is configured as a scan register. It 
can be initialized to contain an arbitrary value (for the 
implementation of constant variables), read and written as 
a regular register, or used for scan testing. A fast carry- 
select adder and logarithmic shifter are used to implement 
the arithmetic functions, and hardwired logic provides 
single-cycle comparison, and maximum and minimum 
functions. Arithmetic can be performed in two's comple- 
ment or unsigned format with automatic saturation. An 
optional pipeline register is available at the output of each 
EXU. By using the register, the user can increase the 
maximum sampling rate by overlapping EXU operations 
with data transmission over the network. This can be use- 
ful in applications where the additional latency has no 
negative effect. However, if the operation is in a feedback 
loop, the additional pipeline register would normally not 
be used. A status flag (a ^ b) is available to other EXU's 



and the external world to affect local and global program 
control flow, respectively. The EXU's normally provide 
16-b accuracy, but two can be concatenated for increased 
32-b accuracy. The register delay, type of arithmetic, out- 
put pipeline register, and EXU linking are controlled by 
mode bits set by the user. 

C. Communication Network 

To ensure flexible, conflict-free and high-bandwidth 
data routing, a crossbar network has been selected to in- 
terconnect the processors. This network routes both data 
as well as status flags. The data routing is under program 
control and can be changed in each program cycle; the 
routing of the status flags is static and set at compile time. 
Static flag routing was chosen as a reasonable compro- 
mise between flexibility and hardware efficiency. 

The main challenge in the design of the crossbar net- 
work is to ensure a pitch matching between the crossbar 
switches and the EXU's. Therefore, a layered crossbar 
structure has been developed as shown in Fig. 5. A detail 
of the data-routing bit slice that connects EXU's A and E 
to each other, to other EXU's in the cluster, and the I/O 
buses is pictured. The layered switch implementation is 
organized as follows. The Type I switch connects the in- 
put of an EXU to either one of its neighbors (£, C, D for 
EXU A) or to the I/O buses or the other half of the cluster 
via a Type II switch. The squares and the circles represent 
inputs and outputs to the switches, respectively. A Type 
II switch is detailed. Data lines are run horizontally and 
control lines vertically. The major advantage of the pro- 
posed approach is that it allows all horizontal buses to fit 
within the pitch provided by the EXU's and hence save a 
substantial amount of area. Finally, in order to make the 
design even denser, the switches are implemented using 
NMOS pass transistors only. Weak PMOS feedback tran- 
sistors restore the weak high level passed by the NMOS 
pass transistors and improve noise margin. 

The problems of sizing the feedback transistors for both 
Type I and II switches were coupled to avoid using extra 
decoupling buffers. Fig. 6 shows a circuit diagram of both 
switches. Fig. 7 shows SPICE switching waveforms of 
nodes A % B, and C of Fig. 6 with W/L of the feedback 
PMOS transistors as a parameter. In the case where W/L 
is zero, no feedback transistor is present, and node B is 
never pulled to rail. In the case where W/L is 8/2, node 
C is never pulled to ground because the inverter driving 
node C is not strong enough to counter the feedback tran- 
sistor of the Type I switch. A reasonable compromise is 
to choose W/L = 3/3. 

D. Control 

Each EXU requires a 53-b horizontal control word, and 
so the overall instruction bandwidth for all eight EXU's 
is quite high. In order to simultaneously achieve both a 
high instruction and data bandwidth, the control strategy 
shown in Fig. 8 was used. At run time, an external se- 
quencer broadcasts a 3-b global instruction to each EXU, 



1898 



IEEE JOURNAL OF SOLID-STATE CIRCUITS. VOL. 27. NO. 12. DECEMBER 1992 



0 INPUT O OUTPUT 



SWITCH SWITCH rtjrr u SWITCH SWITCH 

TYPHI TVPE2 gUT IN JYPE2 TYPE1 

BUS BUS 




MINIMUM METAL PTICH 



















< 


< 


< 


<\ 


< 


1 


1 


< 



MINIMUM METAL PTICH 



Fig. 5. Simplified schematic of crossbar switch. 
TYPE 2 SWITCH < 



1 



1 



IP— 



> TYPE 1 SWITCH 



1 



IP 



1 



IP— 



Fig. 6. Type I and Type II switches with weak PMOS pull-ups. 




(NS) 

Fig. 7. Type 1 and II switching waveforms. 



CHEN AND RABAEY: RECONFIGURABLE MULTIPROCESSOR IC FOR PROTOTYPING OF DATA PATHS 



1899 




3b GLOBAL INSTRUCTION 



NANO 
2 



NANO 
8 



53b LOCAL 
INSTRUCTION 



424b VUW 

Fig. 8. Basic control strategy. 

which is locally decoded into a 53-b instruction word. In 
this fashion, a 3-b word is used to specify a 424-b very 
long instruction word (VLIW). The architecture is SIMD 
in that each EXU receives the same global instruction, but 
MIMD in that each decoded instruction is unique to the 
associated EXU. 

Status information can be communicated between the 
EXU's and the external controller to affect both the local 
and global control flows. Each EXU instruction contains 
two interrupt state fields. By setting these, the EXU can 
accept interrupts from other EXU's or the external world 
and then vector to the apropos address contained in its 
precompiled interrupt vectors. 

The decoders of each EXU are SRAM-based nano- 
stores that are configured at setup time. Fig. 9 shows a 
section of the SRAM. Each SRAM contains eight words, 
which allows eight different operations to be multiplexed 
on its associated EXU. Master-slave scan latches at its 
I/O are connected to a global serial shift register (scan 
chain) to allow serial configuration of the SRAM. The 
SRAM is implemented using a conventional six-transistor 
cell and operates with two-phase nonoverlapping clocks. 

E. Configuration 

All chip configuration registers (e.g., constants, mode 
bits, interrupt vectors) and the SRAM scan latches are 
connected as a serial shift register. Only a few pins are 
needed to configure the chip using this scheme. Eight 
scans, one for each word, are required to completely load 
the nanostores. Two on-board FSM's generate the nec- 
essary clock and interface and internal control signals that 
allows the chip to boot directly from standard EPROM's 
without the need for additional glue logic. One FSM gen- 
erates divided down scan clocks which allows the use of 
standard (slow) EPROM's for booting. The other gener- 
ates control signals for scanning and writing each individ- 
ual line, and for verifying the contents of the nanostores. 
On-board counters indicate when a scan is completed and 
keep track of the current nanostore word being written. 
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Fig. 10. EXU critical path. 

III. Simulation and Testing 
A. Simulation 

SPICE simulations were performed to simulate the crit- 
ical path of the chip. The respective load capacitances 
were estimated from the worst-case IC process parameters 
and incorporated into the SPICE decks. The critical path 
simulation results for an EXU and a four-quadrant chip 
are shown in Figs. 10 and 11 respectively. The units 
shown are in nanoseconds. The critical path begins from 
the issue of a read address to register file B. A delay of 
24 ns is incurred during EXU transit from the register file, 
through the shifter and inversion logic, through the carry 
path of the carry select adder, and through the saturation 
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Fig. 12. Chip photograph. 

logic to the output of the EXU. An additional 15 ns is lost 
during transit through the crossbar networks, after which 
2 ns are required to latch the data into the input of the 
target EXU. The total simulated critical delay is 41 ns. 

The chip was implemented in l.2-jim CMOS technol- 
ogy and tested and was fully functional on first silicon. 
Fig. 12 shows a photograph of the chip and Table II sum- 
marizes the chip characteristics. 

B. Testing 

Fig. 13 shows the assembly code and scope trace for a 
mod 3 counter that cycles between 0, 1 , and 2 at 25 MHz. 

We note that the prototype chip, which contains a sin- 
gle quadrant, runs at a maximum clock frequency of 25 
MHz with a critical path delay of 40 ns. This indicates 
that there is excellent agreement with the SPICE simula- 
tions. The only major difference between a single quad- 
rant and one with four quadrants is the additional inter- 
quadrant transit time which, with proper buffering, can be 
limited to 1 or 2 ns. 



Fig. 13. 25-MHz mod 3 counter. 

The SFG of a low-pass biquadratic filter is shown in 
Fig. 14. The multiplying coefficients were converted to a 
canonic signed digit format to minimize the number of 
nonzero bits and transformed into shifts and adds (Fig. 
15). A processor schedule using three EXU's and three 
instructions is shown in Fig. 16. Fig. 17 shows the assem- 
bly code (mapped to three units and three instructions). 
Fig. 18(a) and (b) shows plots of the acquired impulse 
response and the corresponding frequency response, re- 
spectively. The arithmetic mode is 16-b two's comple- 
ment and the impulse is input at bit 13. The measured 
results agree well with simulations. Due to limitations of 
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Fig. 14. Simple low-pass biquadratic filter. 




Fig. 15. Transformed biquad. 
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Fig. 16. Biquad processor schedule. 

the signal analyzer in acquiring data, the maximum clock 
rate of this biquad was constrained to 10 MHz. 

Table III shows several other benchmarks which can be 
implemented on this architecture using several PADDI 
chips. 

IV. Programming Tools for PADDI 

The low-level programming tools, the pas assembler 
and p s i m simulator, provide the foundation for the 
higher-level synthesis tools. 



DEFAULTS { 

A6 a 0, B6 m 0, SIGHED, UNUNK 
NORMAL_A, NORMALJJ, 
BFSWnll 1111 11b, 
IEM1.0.IEN2-0) 

EXU_A{ 

1:<A8*(B8 ►»«)); 
2: <B6); 

3: AS ■ 121, B6 a EXUA, (AC);} 

EXU_B{ 

1 : A4 » EXU.A, (A6 ♦ B8 » 2); 

2: BB = EXU_A, (A4 ♦ B6); 

3: A6 n EXU_C, B6 a EXU_C, (B8);) 

EXU_C{ 

1 : A2 « EXU_B, B2 = EXU_C, (A*<BS»2»; 

2: A3 =» EXU_B, S3 « EXU_C, (A2 - B2); 

3: A6 = EXU.B, B0 a EXU_B, (A3 «• B3), 01 ;) 

Fig. 17. Biquad assembly code. 

A. The Pas Assembler 

Pas represents the lowest software-level interface be- 
tween the programmer and the PADDI architecture, pro- 
viding a method for describing algorithms. The pas as- 
sembly language was designed and implemented with the 
interconnection network of the PADDI architecture in 
mind: programs written in it can easily exploit intercom- 
munication between execution units. The intercommuni- 
cation follows a "receiver-controlled" model in which 
the receiving unit controls the routing of the actual com- 
munication while the broadcasting unit only concerns it- 
self with the data or flag to be communicated (except when 
broadcasting to the external world; in this case the broad- 
caster must specify which output bus to employ). In ad- 
dition to being able to express all available PADDI op- 
erations in a convenient C-like syntax, the assembler also 
allows for the explicit specification of instructions within 
the nanostores at the individual bit level. 

B. The Psim Simulator 

P s i m serves as a tool for simulating and debugging 
multiple-chip PADDI algorithms in software. It consists 
of a simulation engine coupled with an X-based graphical 
user interface (GUI). The simulation engine can operate 
both as a "black box," allowing it to interface with ex- 
ternal software tools, or as a stand-alone simulation en- 
vironment when coupled with the X-based GUI. The 
stand-alone simulation environment supports many of the 
common debugging features, including single-stepped ex- 
ecution and the ability to modify registers and instructions 
"on the fly." 

C. Software Compilation 

An automated compilation path (Fig. 19) from a high- 
level data flow language Silage [15] to the PADDI chip, 
which includes partitioning, scheduling, and code gener- 
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TABLE III 
Benchmarks 



BENCHMARK 


Possible 
Sampling Rate 


EXU's 
Required 


3x3 Linear 
Convolver 
(Image processing) 


25 MHz 


ll 


3x3 Nonlinear 
Sorting Filter 
(Image processing) 


25 mHz 


16 


RGB Video Matrix 
Converter 


25 MHz 


31 


Flexible Memory 
Control Chip 
for Video Coding 


25 MHz 


28 


Biquad 

Direct Form II 
(time-multiplexed) 


5 MHz 


9 


Biquad 

Direct Form II 
(pipelined) 


25 MHz 


55 



ation, is currently under development. A paper detailing 
the compiler design has been submitted to [6], 

V. Limitations of Other Approaches 

The number of choices for FPGA's is numerous with 
offerings from XILINX [11], Actel [1], Plus Logic [17], 
Plessey [9], Algotronix [8], ATT (10), and others. Due 
to their bit-level granularity, these FPGA's will not sup- 



port as flexible routing of wide data buses and will not 
have as fast adders (for the same technology) as a word- 
level granular architecture with flexible bus interconnec- 
tions and adders optimized for speed. FPGA's also do not 
typically support hardware multiplexing of their CLB's. 
In software-configurable FPGA's, the functions of the 
CLB's can be redefined, but this is not typically done in 
high-speed applications since reconfiguration time is of 
the order of milliseconds. In order to illustrate these 
points, we have mapped several benchmarks to the pop- 
ular XILINX XC3090 family. 

The results are shown in Table IV, case A. The FPGA 
speed numbers are optimistic because no account is taken 
for routing delays. Because of the ability to have faster 
arithmetic for the same technology, more flexible inter- 
connectivity of data buses, support for hardware multi- 
plexing, and more efficient implementation of register 
files, PADDI is better suited for data-path-intensive ap- 
plications. 

When we started our investigations only the XC3000 
family was available. The recently introduced XILINX 
XC4000 series, which uses 0.8- and 0.5-/im CMOS tech- 
nology and which has hooks for faster adders and a dif- 
ferent interconnect architecture, will affect the above 
comparison. 

As a further comparison, a Motorola DSP56000 can 
operate at 10.25 MIPs and has a data I/O bandwidth of 
60 megabytes / s. For video sampling rates, this translates 
to roughly two available instructions per sample. This type 
of limited performance and I/O capability clearly limits 
the applicability of general-purpose DSP's to real time 
DSP applications. 
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Fig. 19. Software environment. 



TABLE IV 
Comparison of XILINX and PADD1 
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3x3 Linear 


144 




40 


11 


Convolver 




(3 chips) 







VI. Conclusion 

The architecture and implementation of a reconfigur- 
able multiprocessor ic for rapid prototyping of real-time 
data paths has been described. The chip targets high-per- 
formance digital signal processing applications. A 16 
EXU (400 MIPs) processor is currently under design, to- 
gether with a multichip module approach, which could 
support up to 32 EXlTs (800 MIPs) in a single package. 
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Video signal processors (VSP's) [2], [3], [7], [13], [23] 
usually have a few complex and highly pipelined arith- 
metic logic units with on-board support for memory, and 
are not designed to support conditional branching effi- 
ciently. The example of [23] can operate three 12-b exe- 
cution units at 27 MHz (81 MIPs) with a data I/O band- 
width of 405 megabytes/s, but typically operates at 13.5 
MHz (40 MIPs) due to the latency of the long pipelines. 
The chip presented here can operate eight 16-b execution 
units at 25 MHz (200 MIPs) with a data I/O bandwidth 
of 400 megabytes/s. Because of the larger degree of con- 
currency due to the smaller level of granularity of the 
EXU's, and smaller branch penalty, PADDI is better 
suited for data-path prototyping. 
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